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ABSTRACT 

In  this  communication,  we  address  multilingual  interoperability 
aspects  in  speech  recognition.  After  giving  a tentative  defini- 
tion of  multilingual  interoperability,  we  discuss  speech  recogni- 
tion components  and  their  language-specific  aspects.  We  give 
a sample  overview  of  past  multilingual  speech  recognition  re- 
search and  development  across  different  speaking  styles  (read, 
prepared  and  conversational).  The  problem  of  adaptation  to 
new  languages  is  addressed.  Language-independent  and  cross- 
language techniques  for  acoustic  modeling  provide  a means  to 
port  recognition  systems  to  new  languages  without  language 
specific  acoustic  data.  Pronunciation  lexica  and  text  material 
appear  to  be  the  most  crucial  language-dependent  resources  for 
porting.  Fast  porting  being  a step  towards  multilingual  interop- 
erability the  ongoing  efforts  of  producing  multilingual  pronun- 
ciation lexica  and  collecting  multilingual  text  corpora  should  be 
extended  to  the  largest  possible  number  of  written  languages. 


1.  INTRODUCTION 

The  important  progress  achieved  in  speech  recognition  these  last 
decades  has  led  to  successful  demos  using  speech  technology. 
Demos  raise  expectations  when  shown  to  potential  users,  but 
yet  only  few  systems  are  ready  for  operational  use.  In  a mul- 
tilingual environment,  where  potential  users  have  distinct  na- 
tive languages,  speech  recognition  systems  have  to  deal  with 
these  different  languages  or  with  non-native  speaker  accents,  if 
a common  language  is  shared.  Multilingual  environments  are 
common  in  international  communication  contexts,  which  may 
be  political,  military,  scientific,  commercial  or  tourist  contexts. 
The  development  of  multilingual  recognition  and  spoken  dialog 
systems  is  hence  an  important  research  issue,  opening  a large 
spectrum  of  potential  applications.  To  increase  the  usability  of 
a prototype  system  the  problems  of  multilingual  and  non-native 
speech  have  to  be  addressed  efficiently. 

Speech  recognizers  are  still  very  sensitive  to  non-native 
speech  input  or  more  generally  to  any  kind  of  condition  mis- 
match. Porting  a given  system  to  a new  language  requires  of- 
ten a significant  part  of  language  specific  knowledge  and  re- 
sources before  achieving  viable  recognition  results.  Multilin- 
gual corpora  have  been  gathered  for  language  identification  and 
multi-lingual  recognition  research  (OGI-TS,  LDC  CallHOME, 
GlobalPhone...).  Research  and  development  in  multilingual 


recognition  has  been  widely  supported  by  the  European  commu- 
nities (EC)  and  the  Defense  Advanced  Research  Project  Agency 
(DARPA)  [39, 5, 12,  40, 14,  43]. 

In  this  contribution  we  address  issues  of  multilinguality  and 
multilingual  interoperability  in  speech  recognition. 

Using  a standard  recognizer  architecture  based  an  acous- 
tic HMM  phone  models,  pronunciation  dictionaries  and  word 
N-gram  language  models,  the  language-specific  aspects  of  each 
component  are  discussed.  Many  observations  are  gathered  from 
our  experience  at  LIMSI  in  developing  multilingual  speech  rec- 
ognizers [35,  54,  2,  1,  4].  We  will  then  focus  on  multilingual 
recognition  systems.  Without  attempting  to  be  exhaustive  we 
try  to  give  an  overview  of  some  representative  research  actions 
in  multilingual  and  cross-lingual  speech  recognition. 

2.  MULTILINGUALITY  AND  MULTILINGUAL 
INTEROPERABILITY 

There  exist  about  3000  different  spoken  languages  without  ac- 
counting for  dialects,  at  the  end  of  this  millennium  [38].  Ac- 
cording to  this  author  only  several  100  languages  have  also  a 
significant  written  language  production  for  which  current  speech 
recognition  systems  (speech  to  text  systems)  are  applicable. 
Studies  in  automatic  speech  recognition  (ASR)  are  presently 
limited  to  about  20  languages,  comprising  English,  Arabic,  Chi- 
nese, Japanese,  Spanish,  French,  German,  Italian,  Portuguese, 
Greek,  Swedish,  Danish,  Dutch... 

Interoperability  is  a term  which  is  widely  used  in  product 
marketing  descriptions:  products  achieve  interoperability  with 
other  products  either  by  adhering  to  published  interface  stan- 
dards (example:  the  WEB  with  standards  such  as  TCP/IP,  HTTP, 
HTML)  or  by  making  use  of  a “broker'’  of  services  that  can 
convert  one  product’s  interface  into  another  product’s  interface 
on  the  fly  (example;  common  object  request  broker  architec- 
ture CORBA).  Interoperability  becomes  a quality  of  increasing 
importance  for  information  technology  products,  and  naturally, 
the  demand  for  interoperability  of  speech  technology  products 
arises.  Voice  over  IP  (VoIP)  protocols  have  already  evolved  into 
world-wide  standards  (IETF’s  SIP,  ITU,s  H.323)  to  support  the 
emerging  voice,  data  and  video  services  of  the  next  millennium. 

For  speech  recognition  systems  the  term  of  interoperability 
is  not  yet  commonly  used  in  the  corresponding  researcher  com- 
munity. Nonetheless  many  past  or  present  research  actions  aim 
at  defining  standards  for  text  and  speech  processing  (e.g.  the  EC 
Eagles  project  on  language  engineering  standards  [26]),  at  de- 
veloping multilingual  resources  ([51, 45, 15, 12, 5]),  at  installing 
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multilingual  recognizer  evaluations  (e.g.  the  EC  Sqale  project 
on  multilingual  speech  recognition  evaluation,  the  DARPA 
Hub5  program  on  conversational  multilingual  speech),  and  at 
achieving  larger  robustness  across  varying  experimental  condi- 
tions (e.g.  the  DARPA  Hub3  program  and  Hub4  broadcast  news 
transcriptions).  Research  towards  better  multilingual  interoper- 
ability is  supported  and  fostered  by  national  and  international 
institutions:  EC  (European  Commission),  NSF  (National  Sci- 
ence Foundation),  DARPA... 

Multilingual  interoperability  which  is  the  topic  of  this 
workshop  deals  with  the  problem  of  designing  speech  prod- 
ucts which  are  operative  in  a multilingual  context  and/or  eas- 
ily portable  to  new  languages.  The  development  of  multi- 
lingual corpora  and  resources  can  be  considered  as  a mile- 
stone on  the  way  to  multilingual  interoperability.  Developing 
such  resources  however  is  time-consuming,  expensive  and  their 
reusability  is  not  always  ensured,  when  moving  to  new  appli- 
cation domains.  Important  related  research  areas  concern  cross- 
domain  portability.  Research  directions  towards  more  language- 
independent  approaches  for  speech  recognition  are  also  being 
investigated [47,  32,  31]  especially  for  acoustic  modeling. 

3.  SPEECH  RECOGNITION 

We  briefly  review  the  main  components  of  the  recognizer  in  a 
statistical  approach  commonly  used  for  LVSR  (j Large  Vocabu- 
lary Speech  Recognition)  [6],  [27],  [53]  and  discuss  to  what  ex- 
tend these  components  are  language-specific.  The  speech  rec- 
ognizer has  to  determine  the  most  probable  word  sequence 
given  the  acoustic  input  xf : 

— arg  max  Pr(u>”)  Pr(a;f  |w") 

} 

where  w ” is  a sequence  of  n words  each  in  the  lexicon, 
n being  a positive  integer.  The  acoustic  input  xf  is  a feature 
stream,  chosen  so  as  to  reduce  model  complexity  while  trying 
to  keep  the  relevant  information  (i.e.  the  linguistic  information 
for  the  speech  recognition  problem).  While  the  use  of  language- 
dependent  acoustic  features  has  been  investigated  (see  dedicated 
session  of  ICSLP’98)  acoustic  parameter  extraction  can  be  con- 
sidered as  mostly  language-independent. 

Pr(w)  is  to  be  provided  by  a language  model,  and  Pr(a;|«;) 
by  an  acoustic  model.  The  recognition  decision  is  taken  as  a 
joint  optimization  of  two  terms:  Pr (w),  the  a priori  probabil- 
ity of  a word  or  a word  sequence  as  given  by  the  language 
model  and  Pr(x\ w)  the  conditional  probability  of  the  signal  cor- 
responding to  the  word  sequence,  given  by  the  acoustic  model. 
The  output  w ^ is  a sequence  of  items  from  the  vocabulary  {w;}. 
Pronounced  items  which  are  not  in  the  lexicon  (referred  to  as 
out-of- vocabulary  words  or  OOVs)  are  necessarily  missing  in 
the  recognizer’s  output,  and  thus  mis  recognized.  Hence  the  mo- 
tivation for  maximizing  lexical  coverage  by  appropriate  defini- 
tion and  selection  of  the  lexical  items  during  training. 

• the  acoustic  model  Pr(#|u;) 

Acoustic  units  generally  correspond  to  subword  units 
which  when  compared  with  word  models,  reduce  the  num- 
ber of  parameters,  enable  cross  word  modeling  and  port- 
ing to  new  vocabularies  in  a monolingual  context.  For 
Hidden  Markov  Model  (HMM)  based  systems  acoustic 


modeling  most  commonly  makes  use  of  context-dependent 
(CD)  phone  units.1  Pr(:r|u;)  is  then  obtained  via  a pro- 
nunciation lexicon,  where  each  word  ivi  is  described  as  a 
sequence  of  the  appropriate  phones: 

$(«>•)  = <t>\  ® (j>2  © • • • <k'm 
Pr(:r|u>i)  = Pr(:r|<I>(u>t))  = Pr(x\<f>\  ©^©...  <£m) 
Consistent  use  of  the  different  phone  symbols  in  the  lex- 
icon is  probably  the  most  important  requirement  in  pro- 
nunciation generation.  CD  models  allow  for  implicit  coar- 
ticulation modeling  within  the  acoustic  model.  Coartic- 
ulation due  to  the  surrounding  phones  necessarily  occurs 
for  all  languages  and  hence  context  modeling  should  be 
an  effective  approach  for  any  language.  As  Cl  models 
merge  all  different  coarticulation  effects  within  the  same 
model,  they  are  more  robust  as  compared  to  CD  models. 
Separating  co articulation  effects  using  an  increasing  num- 
ber of  contexts  results  in  a more  accurate  representation 
of  the  acoustic  patterns.  CD  models,  accounting  for  the 
phonotactic  constraints  of  the  language,  are  hence  more 
language-specific  than  Cl  models.  Concerning  the  acous- 
tic phone  models  (Cl  or  CD)  we  have  to  be  aware  that  they 
always  best  model  the  most  frequently  observed  coartic- 
ulation effects  of  the  training  data.  For  training  corpora 
with  a low  lexical  variety,  Cl  phone  models  tend  to  be- 
come word-dependent  with  possibly  poor  generalization 
abilities,  both  intra  and  inter  language. 
Language-dependent  Cl  models  (and  even  recently 
context-dependent  phone  models  [31])  have  been  experi- 
mented with  for  porting  a recognizer  to  new  languages. 

To  overcome  the  problem  of  unobserved  sounds  when 
porting  acoustic  models  to  a new  language,  studies  aim- 
ing at  developing  multilingual  or  language-independent 
acoustic  phone  models  are  undertaken  both  for  speech 
recognition  and  language  identification.  Recent  re- 
searches on  language-independent  acoustic  phone  models 
and  cross -language  adaptation  can  be  found  in  [47,  32, 
31,  16].  These  studies  tend  to  demonstrate  the  viability 
of  a language-independent  acoustic  modeling  approach. 
Whereas  it  is  important  to  be  able  to  bootstrap  a recognizer 
for  a new  language  without  prior  acoustic  models  of  that 
language,  most  researchers  tend  nonetheless  to  conclude 
that  using  a small  amount  of  language-specific  acoustic 
data  either  to  train  language-dependent  models  or  to  carry 
out  a language-dependent  adaptation,  rapidly  outperforms 
foreign  language  data.  MLLR  [37]  and  MAP  adaptation 
techniques  are  used  for  adapting  cross-lingual  or  multilin- 
gual acoustic  models  to  the  new  language. 

• the  language  model  Pr(iu) 

Language  models  are  used  to  model  regularities  in  natural 
language,  and  can  therefore  be  used  in  speech  recognition 
to  predict  probable  word  sequences  during  decoding.  The 
most  popular  methods,  such  as  statistical  n- gram  models, 
attempt  to  capture  the  syntactic  and  semantic  constraints 

1 In  some  real-time  systems  context-independent  (Cl)  phone 
units  may  be  used  in  order  to  reduce  the  computation  time  and 
search  space. 
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by  estimating  the  frequencies  of  sequences  of  n words. 
The  lexical  unit,  wi , can  be  considered  the  basic  obser- 
vation for  statistical  language  models.  The  extraction  of 
Wi  units  from  text  sources  can  be  more  or  less  straightfor- 
ward depending  on  the  language  (e.g.  easy  for  English  or 
French,  difficult  in  Japanese:  no  spacing  between  words) 
Given  a fixed  amount  of  training  data,  less  reliable  lan- 
guage models  (LMs)  are  usually  obtained  for  highly  in- 
flected languages  (with  large  lexical  variety)  than  for  less 
inflected  languages.  The  same  observation  can  be  made 
for  agglutinative  languages.  In  the  latter  case  decom- 
pounding could  be  applied  for  lexical  unit  definition.  Tok- 
enizations  or  text  normalizations  aimed  at  reducing  lexical 
variety  include  some  language-independent  and  a variable 
amount  of  more  or  less  complex  language-dependent  pro- 
cessing [1,  24]. 

The  effectiveness  of  N-gram  LMs  for  a given  language 
also  depends  on  the  validity  of  the  approximation  of  cap- 
turing the  language  structure  within  sequences  of  N words. 
We  know  that  the  validity  of  this  approximation  is  strongly 
language-dependent,  and  hence  the  N-gram  modeling  ap- 
proach will  not  give  the  same  benefit  to  speech  recogni- 
tion systems  for  all  languages,  even  if  no  limit  on  available 
training  data  were  imposed. 

• the  decoder  arg  max^n y 

The  search  space  to  be  explored  by  the  decoder  is  related  to 
the  lexicon  size  and  the  language  model  (LM)  complexity. 
For  a bigram  LM  the  search  space  is  proportional  to  the 
lexicon  size.  Pronunciation  variants  introduce  additional 
entries  in  the  search  space.  Computational  requirements 
can  be  controlled  by  limiting  LM  size,  lexicon  size  and 
pronunciation  variants. 

A speech  recognizer  should  meet  the  following  require- 
ments to  guarantee  good  performance.  The  vocabulary,  the 
acoustic  and  language  models  have  to  achieve  good  cover- 
age during  the  system’s  operating  conditions.  The  vocabulary 
should  thus  contain  all  or  most  words  likely  to  appear  during 
operation.  This  means  that  the  out  of  vocabulary  (OOV)  word 
rate  should  be  minimal.  Acoustic  models  should  be  able  to  ac- 
curately model  the  vocabulary  words.  Context-dependent  mod- 
els allowing  for  a high  coverage  of  the  vocabulary  are  likely 
to  produce  better  results,  than  context-independent  models  or 
contextual  models  which  are  seldom  observed  during  operation. 
Similarly  language  models  should  produce  low  perplexity  dur- 
ing operation.  The  same  criteria  have  to  be  met  by  multilingual 
systems. 

4.  MULTILINGUAL  SPEECH  RECOGNITION 

Ideally  a multilingual  speech  recognizer  is  able  to  transcribe 
speech  from  different  languages,  thus  identifying  both  the  lan- 
guage used  and  the  word  sequence  uttered  by  the  speaker. 
Whereas  language  and  word  string  can  be  identified  in  paral- 
lel (multi-lingual  recognizer),  a more  effective  way,  at  least  for 
now,  is  to  prior  identify  the  language  using  a language  iden- 
tification system  on  homogeneous  acoustic  segments,  and  then 
decode  the  word  string  with  the  appropriate  language-dependent 
recognizer. 


Existing  systems  have  been  developed  for  specific  domains 
and  a restricted  number  of  languages,  requiring  large  amounts 
of  annotated  language-specific  corpora.  Without  trying  to  be 
exhaustive,  we  can  cite  some  examples  of  multilingual  recog- 
nizer developments:  the  LE-Sqale  project  on  read  speech  LVSR 
in  English,  German  and  French  [35,  54],  the  DARPA  Hub5 
program  on  conversational  and  multilingual  speech  LVCSR 
( Large  Vocabulary  Conversational  Speech  Recognition)  over 
telephone  [9,  12]  using  Switchboard  and  CallHome  cor- 
pora. 

4.1,  Multilingual  LVSR  using  read  speech 

The  aim  of  the  EC  Sqale  project  (Speech  recognizer  Quality 
Assessment  for  Linguistic  Engineering)  was  to  experiment  with 
installing  in  Europe  a multilingual  evaluation  paradigm  for  the 
assessment  of  large  vocabulary,  continuous  speech  recognition 
systems  (LVSR)  to  assess  language-dependent  issues  in  multi- 
lingual recognizer  evaluation.  This  project,  running  from  1993 
to  1995  gathered  CUED  Cambridge  (UK),  Philips  Aachen  (Ger- 
many), LIMSI  Paris  (France)  and  TNO  Soesterberg  (Nether- 
lands). 

In  the  SQALE  project,  the  same  system  is  being  evaluated 
on  comparable  tasks  in  different  languages  (American  English, 
British  English,  French  and  German)  to  determine  cross-lingual 
differences.  The  recognizer  makes  use  of  phone-based  contin- 
uous density  HMM  for  acoustic  modeling  and  n-gram  statistics 
estimated  on  newspaper  texts  for  language  modeling.  The  sys- 
tem has  been  evaluated  on  a dictation  task  developed  with  read, 
newspaper-based  corpora,  the  ARPA  Wall  Street  Journal  corpus 
of  American  English,  the  WS  JCAMO  corpus  for  British  English, 
the  BREF-Z^  Monde  corpus  of  French  and  the  PHONDAT- 
Frankfurter  Rundschau  coipus  for  German.  Experimental  re- 
sults under  closely  matched  conditions  are  reported.  The  aver- 
age word  accuracy  across  all  4 languages  is  about  85%,  obtained 
for  a 20k  vocabulary  open  test  (65k  open  test  for  German)  on  a 
multilingual  test  set  where  the  OOV  rates  are  kept  comparable 
across  languages  (about  2%  OOVs)  Trigram  LMs  and  context- 
dependent  acoustic  models  were  used  (about  800  CD  models  for 
French  and  more  than  2500  tied-state  CD  models  for  English 
and  German).  A similar  recognizer  was  developed  in  Japan  [42] 
using  180M  business  newspapers.  With  a 7k  vocabulary  and  an 
appropriate  7k  test  set  without  OOV  words,  an  80%  word  ac- 
curacy rate  is  achieved  using  a bigram  LM  and  about  700  CD 
models. 

In  Tab.  1,  lexical  variety  across  different  languages  was  in- 
vestigated for  comparable  amounts  of  text  corpora2.  Coverage 
figures  of  Japanese  reported  in  [42]  are  very  close  to  those  ob- 
tained for  Italian.  Whereas  English  achieves  the  highest  lexical 
coverage  (close  to  100%  for  a 65k  vocabulary,  German  has  the 
highest  OOV  rate  of  about  5%.  For  a given  speech  technology 
(e.g.  a 65k  system)  better  results  can  thus  be  expected  for  En- 
glish than  for  German.  In  German,  a major  obstacle  to  high  lexi- 
cal coverage  arises  from  inflected  forms  and  word  compounding 

2 The  newspaper  text  corpora  compared  are  the  Wall  Street 
Journal  (American  English),  Le  Monde  (French),  Frankfurter 
Rundschau  (German)  from  the  ACL-ECI  cdrom,  II  Sole  24  Ore 
(Italian),  and  Nikkei  (Japanese). 


72 


language 

English 

Italian 

French 

German 

Japanese 

corpus 

WSJ 

Sole  24 

Le  Monde 

FR 

Nikkei 

#words 

37.2M 

25.7M 

37.7M 

36M 

180M 

# distinct 

165k 

200k 

280k 

650k 

623k 

5k  coven 

90.6 

88.3 

85.2 

82.9 

88.0 

20kcover.% 

97.5 

96.3 

94.7 

90.0 

96.2 

65k  cover.% 

99.6 

99.0 

98.3 

95.1 

99.2 

20k-OOV% 

2.5 

3.7 

5.3 

10.0 

3.8 

65k-OOV% 

0.4 

1.0 

1.7 

4.9 

0.8 

Table  1 : Comparison  of  WSJ,  II  Sole  24  Ore,  Le  Monde  , Frankfurter  Rundschau  and  Nikkei  text  corpora  in  terms  of  number  of  distinct 
words  and  lexical  coverage  of  the  text  data  for  different  lexicon  sizes.  OOV  rates  are  shown  for  20k  and  65k  lexica. 


for  which  morphological  decomposition  could  be  effectively  ap- 
plied. 

More  recently  within  the  German  GlobalPhone  project 
a multilingual  read  speech  database  comprising  15  languages 
(Arabic,  Chinese,  Croatian,  English,  French,  German,  Ital- 
ian, Japanese,  Korean,  Portuguese,  Russian,  Spanish,  Swedish, 
Tamil  and  Turkish)  has  been  collected.  Using  these  data  the 
University  of  Karlsruhe  is  working  on  a multilingual  LVSR  sys- 
tem [47].  Their  research  efforts  focus  on  multilingual  acous- 
tic modeling  and  fast  bootstrapping  of  acoustic  models  for  new 
languages.  Speech  recognition  results  have  been  obtained  for  6 
languages  (word  error  rates  ranging  from  10%  to  near  50%)  us- 
ing 10k  vocabularies.  Closed  test  sets  have  been  used  by  adding 
missing  words  in  the  vocabularies  and  assigning  a low  proba- 
bility to  the  corresponding  monograms  in  the  LM.  The  multilin- 
gual text  material  is  yet  too  limited  for  reliable  language  model 
estimation. 

Experiments  in  multilingual  read  speech  recognition  indi- 
cate that  good  performances  can  be  achieved  across  languages, 
provided  that  sufficient  training  material  is  available  (10-100 
hours  of  speech,  50-200M  of  words). 

4.2.  Multilingual  LVCSR  using  conversational  speech 
The  C allHome  program  [14]  (part  of  the  DARPA  Hub5  pro- 
gram) was  initiated  in  the  US  in  1995  in  order  to  study  con- 
versational speech  between  family  members  over  long-distance 
telephone  in  a multilingual  context.  Corpora  were  recorded  in 
English,  Mandarin,  Japanese  and  Spanish  (with  a variety  of  di- 
alects) during  1995,  Arabic  (colloquial  Egyptian)  and  German 
during  1996.  LDC  provided  the  multilingual  data  to  partici- 
pants. Word  error  rate  results  reported  in  1997  range  from  about 
40%  for  English  to  around  60%  for  Spanish,  Arabic,  Mandarin 
and  German.  As  stated  by  G.  Zavaliakos  [55],  work  on  Call- 
Home  Corpora  has  verified  that  current  technology  is  largely 
language  independent.  The  better  results  obtained  in  English 
can  be  related  to  relatively  more  training  data  available  in  this 
language  and  maybe  a longer  and  more  reliable  expertise  in  En- 
glish system  development.  Nonetheless  word  error  rates  remain 
high  across  the  different  languages,  significantly  higher  than 
those  reported  for  read  or  prepared  broadcast  speech  (around 
20%  word  error  rates,  Hub4  DARPA  program).  To  measure  the 
impact  of  mere  speaking  style  on  recognition  results,  by  con- 


trolling speaker,  channel  and  LM  effects,  an  interesting  experi- 
ment was  carried  out  at  SRI  as  reported  in  [14].  Conversational 
speech  was  recorded  and  then  transcribed.  The  same  speakers 
were  then  invited  to  read  the  transcriptions,  imitating  sponta- 
neous style  and  a second  time  in  pure  read  style.  Word  error 
rates  of  about  50%  for  the  true  conversational  style,  drop  to 
about  40%  for  the  false  spontaneous  elocution,  and  to  around 
30%  for  the  read  version.  Conversational  speech  doesn’t  fit  the 
spoken  language  modeling  assumptions  as  well  as  read  speech 
(see  section  3.).  This  is  particularly  true  for  the  articulated  phone 
sequence  assumption  of  the  pronunciation  lexicon. 

Results  are  consistently  disappointing  across  languages  for 
conversational  speech.  Whereas  read  or  broadcast  speech  can  be 
considered  as  normative  to  be  understood  by  a large  audience, 
familiar  conversational  speech  spreads  a larger  variety  of  indi- 
vidual speaking  styles.  This  may  explain  the  discrepancy  ob- 
served between  performance  in  read  and  conversational  speech. 
For  the  C allHome  languages  about  15  hours  of  acoustic  train- 
ing data  and  about  150k  words  for  language  model  estimation 
were  available.  Vocabulary  sizes  ranged  from  about  10k  to  about 
20k  [12].  Experience  taken  from  conversational  speech  in  En- 
glish (using  Switchboard)  shows  that  significant  error  reduction 
(i.e.  better  conversational  speech  modeling)  can  be  achieved 
when  moving  from  15  to  150  hours  of  speech  and  from  150k 
to  2M  words. 

4.3.  Multilingual  Broadcast  Transcriptions 

The  DARPA-Hub4  program,  introduced  in  1995,  concerns 

broadcast  news  transcription. 

Within  the  Broadcast  transcription  program,  data  collection 
and  corpus  design  have  become  more  efficient,  as  large  amounts 
of  news  are  constantly  available.  Corpus  transcription  and  anno- 
tation standards  [10]  have  been  developed.  Annotated  corpora 
are  easily  created  using  freeware  transcribing  tools  [7].  Human 
broadcast  transcription/annotation  can  range  from  10-50  times 
real-time. 

Whereas  the  main  effort  is  centered  on  English  sources, 
non  English  (multilingual)  evaluations  have  been  carried  out  for 
Spanish  [25]  and  Chinese  systems  [56],  demonstrating  the  fea- 
sibility for  other  languages.  English  best  results  are  below  20% 
word  error  rate.  Error  rates  on  non-native  speech  (F5  condi- 
tion [48])  are  higher  for  the  corresponding  native  condition  (F0), 
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but  the  F5  proportion  remains  low  in  the  overall  test  sets. 

Automatically  generated  broadcast  news  transcripts  can  be 
used  for  indexing  or  document  retrieval  tasks  (NIST  SDR  pro- 
gram). These  research  areas  go  in  the  direction  of  speech  un- 
derstanding. The  benefits  of  the  Broadcast  news  task  on  speech 
recognition  technology  progress  is  discussed  in  [33]. 

In  Europe  the  EC  is  also  sponsoring  research  on  multilin- 
gual broadcast  transcriptions.  As  an  example  we  can  cite  the 
LE4-Olive  project  launched  in  1998,  which  aims  to  support 
automated  indexing  of  video  material  by  use  of  human  lan- 
guage technologies  and  in  particular  multilingual  speech  recog- 
nition. The  prime  interest  of  the  Olive  users  is  to  obtain 
an  efficient,  detailed  and  direct  access  to  their  video  archives. 
The  users  in  the  OLIVE  consortium  are  two  television  sta- 
tions, comprising  ARTE  (Strasbourg,  France)  and  TROS  (Hil- 
versum,  Netherlands),  as  well  as  the  French  national  audio- 
video archive,  INA/Inatheque  in  Paris,  France,  and  NOB,  a large 
service  provider  for  broadcasting  and  TV  productions  (Hilver- 
sum,  Netherlands).  Technology  development  and  system  imple- 
mentation involve:  TNO-TPD  (Delft),  the  project  co-ordinator 
supplying  the  core  indexing  and  retrieval  functionality,  VDA  BV 
(Hilversum)  building  the  video  capturing  software,  the  Univer- 
sity of  Twente  and  the  LT  Lab  of  DFKI  GmbH  Saarbriicken, 
responsible  among  others  for  the  natural  language  technology, 
LIMSI-CNRS  (Orsay,  France)  and  Vecsys  S A (Les  Ulis,  France) 
developing  and  integrating  the  speech  recognition  modules,  re- 
spectively. 

Olive  is  making  use  of  speech  recognition  in  English, 
French  and  German  to  automatically  derive  transcriptions  of  the 
sound  tracks,  generating  time-coded  linguistic  elements  which 
serve  as  the  basis  for  text-based  retrieval  functionality.  Confi- 
dence scores  are  associated  with  each  hypothesized  word  to  al- 
low further  processing  steps  to  take  into  account  the  reliability 
of  the  candidates. 

Taking  advantage  of  the  corpora  available  through  the 
LDC,  the  speech  recognizer!  18,  21]  has  been  developed  and 
tested  on  American  English.  The  acoustic  models  are  trained 
on  150  hours  of  transcribed  audio  data,  with  the  language  mod- 
els trained  on  200M  words  broadcast  news  transcriptions  and 
400M  words  of  newspaper  and  newswire  texts.  Using  broadcast 
data  collected  in  OLIVE,  LIMSI  has  ported  its  American  English 
system  to  French.  A port  to  German  is  underway. 

Experiments  with  700  hours  of  unrestricted  broadcast  news 
data  indicate  that  word  error  rates  around  20%  are  obtained  for 
American  English.  Preliminary  experiments  in  French  and  Ger- 
man indicate  that  the  word  error  rates  are  higher,  which  can  be 
expected  as  these  languages  are  more  highly  inflected  than  En- 
glish, and  less  training  data  are  available.  However,  it  has  to 
be  kept  in  mind,  that  for  the  purpose  of  indexing  and  retrieval 
a 100%  recognition  rate  is  not  necessary,  since  not  every  word 
will  have  to  make  it  into  the  index,  and  not  every  expression  in 
the  index  is  likely  to  be  queried.  Research  into  the  differences 
between  text  retrieval  and  spoken  document  retrieval  indicates 
that  recognition  errors  do  not  add  new  problems  for  the  retrieval 
task  [28]. 

The  broadcast  transcription  testbed  is  particularly  rich  in 
varying  acoustic  conditions,  topics,  domains  and  languages, 
with  native  and  non-native  speakers.  Significant  progress  in 


multilingual  interoperability  can  be  expected  from  research  in 
broadcast  transcriptions. 

4.4.  Portability 

Porting  a speech  recognizer  to  a new  language  consists  mainly 
in  the  creation  of  the  language  specific  acoustic  models,  pro- 
nunciation lexica  and  language  models.  As  mentioned  before 
the  acoustic  parameter  extraction,  the  model  estimation  tech- 
niques and  the  search  engine  may  be  considered  as  language- 
independent.  Porting  can  thus  appear  as  a rather  straightforward 
process,  provided  there  are  sufficient  speech  and  text  databases 
available,  together  with  either  a pronunciation  lexicon  or  ap- 
propriate letter  to  sound  rules  for  the  pronunciation  generation. 
In  the  previously  described  SQALE  and  CallHome  programs 
multilingual  resources  were  provided  to  the  different  partici- 
pants for  system  development.  Porting  efforts  can  then  be  lim- 
ited in  time  to  a several  months  span.  Much  of  the  demonstrated 
progress  in  speech  recognition  and  spoken  language  understand- 
ing over  recent  years  has  been  fostered  by  the  availability  of 
large  commonly  used  corpora  for  system  training  and  evalua- 
tion in  different  languages. 

But  these  resources,  while  in  constant  increase  are  still 
lacking  for  many  human  languages.  Especially  in  military  and 
intelligence  applications,  interest  in  exotic  languages  may  arise 
suddenly  and  the  porting  phase  should  span  the  shortest  duration 
possible. 

4.4.1.  Porting  using  language-dependent  resources 
In  the  following  we  relate  some  of  our  experience  from  the 
Sqale  project  where  our  read  speech  recognition  systems  of 
American  English  and  French  have  been  ported  to  British  En- 
glish and  to  German.  Language-dependent  resources  (tran- 
scribed speech,  text  material  and  pronunciation  dictionaries) 
were  available  to  all  partners. 

For  German  the  acoustic  models  were  bootstrapped  using 
a mix  of  French  and  English  models.  German  acoustic  models 
were  then  estimated  from  the  PHONDATread  speech  database, 
available  for  research  purposes  from  the  University  of  Munich. 
Phondat  contains  a variety  of  prompt  types  including  phonet- 
ically balanced  sentences,  a few  short  stories,  isolated  letters 
and  train  timetable  queries.  There  are  a total  of  15,000  sen- 
tences from  155  speakers.  Vocabulary  items  are  rather  limited, 
with  only  about  1700  different  words  and  the  prompt  texts  are 
quite  different  in  style  from  the  language  model  training  mate- 
rial (taken  from  newspaper  texts).  Despite  these  relatively  mis- 
matched acoustic  data  as  compared  to  the  read  newspaper  task, 
and  despite  the  limited  amount  of  distinct  lexical  items,  good 
recognition  performance  could  be  observed  for  German.  But  we 
have  to  recall  two  important  facts:  first  the  German  system  used 
a 65k  vocabulary  to  get  acceptable  lexical  coverage,  whereas  for 
the  other  languages  the  systems  were  still  using  20k  vocabular- 
ies. Second  the  SQALE  test  sets  were  designed  to  achieve  similar 
OOV%  rates  of  about  2%  for  all  languages:  the  OOV  rate  with  a 
20k  lexicon  without  OOV  control  on  the  test  is  10%  in  German 
(2.5%  in  American  English).  The  OOV  problem  could  be  re- 
duced by  decompounding  compound  words,  as  was  done  for  the 
numbers  during  text  normalization.  Decompounding  is  however 
a non-trivial  task  requiring  a refined  morphological  analysis  and 
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even  sometimes  semantic  information.  Many  compounds  can 
result  in  two  and  more  items  depending  on  the  degree  of  mor- 
phological analysis  carried  out.  For  example  consider  the  fol- 
lowing compound  word  occurring  in  the  training  texts:  Bundes - 
bahnoberamtsrat  (approximate  translation:  Federal-Rail-Head- 
Office-Chief).  The  following  decompositions  are  possible  and 
semantically  correct: 

Bundesbahnoberamtsrat  — y Bundes  Bahn  Ober  Amts  Rat 
Bundesbahnoberamtsrat  -»  Bundesbahn  Ober  Amtsrat 
Bundesbahnoberamtsrat  — ¥ Bundesbahn  Oberamtsrat 
Other  decompositions  such  as: 

Bundesbahnoberamtsrat  —¥  Bundes  Bahnober  Amtsrat 
are  possible,  but  semantically  poor.  This  example  clearly  illus- 
trates that  word  compounding  in  German  constitutes  an  OOV- 
source,  as  long  the  recognition  system  considers  a word  to  be  an 
item  occurring  between  two  spaces. 

German  system  development  would  have  taken  benefit 
from  a reliable  morphological  analyzer,  both  for  the  quality  of 
the  vocabulary  (better  coverage)  and  for  the  LM  (more  data  to 
estimate  Ngrams).  As  mentioned  before  even  the  pronunciations 
could  have  been  improved,  as  a lack  of  consistency  may  occur 
when  a given  morpheme  is  observed  in  a long  list  of  compounds. 

To  conclude  here  we  can  say  that  porting  to  a new  language 
can  be  very  fast  if  all  resources  are  available.  A baseline  system 
can  then  be  produced  in  a short  delay.  In  a second  step  de- 
velopments can  be  carried  out  to  better  account  for  language- 
specificities:  typical  pronunciation  variants,  regional  accents, 
stemming,  decompounding  for  agglutinative  languages...,  Here 
years  can  be  spent  to  move  away  from  a baseline  performance. 

4,4.2 . Lacking  training  data  for  the  new  language:  cross - 
lingual  approaches 

A tentative  definition  of  cross-lingual  modeling  can  be  the  fol- 
lowing: resources  from  one  or  multiple  source  languages  are 
used  to  estimate  models  for  a new  target  language.  Cross-lingual 
approaches  can  apply  for  acoustic  phone  modeling  as  similar 
sounds  are  often  shared  across  different  languages.  A relatively 
large  number  of  research  actions  aim  at  defining  multilingual 
or  language-independent  acoustic  model  sets  [47,  32,  31].  The 
availability  of  language-independent  acoustic  models  reduce  the 
problem  of  lacking  acoustic  data  in  the  target  language. 

For  lexical  and  language  modeling  however  language- 
dependent  resources  remain  mandatory,  at  least  at  the  present 
state-of-art.  Progress  may  be  achieved  through  research  areas 
comprising  machine  translation,  multilingual  indexing,  speech 
understanding. 

The  problem  of  insufficient  training  material  is  addressed 
in  [55].  According  to  this  author  the  dominant  factor  with  re- 
spect to  performance  is  the  amount  of  training  data  available. 
The  author  proposes  to  use  the  automatically  transcribed  test 
data  of  the  new  language  to  adapt  the  acoustic  models  to  the 
new  language.  The  proposed  method  shows  a slight  but  consis- 
tent gain  in  word  accuracy  when  using  a subset  of  automatically 
transcribed  data,  selected  using  a confidence  measure  criterion, 
to  adapt  acoustic  and  language  models. 


5.  CONCLUSION 

We  can  consider  that  present  recognition  systems  are  potentially 
multilingual,  as  the  same  family  of  methods  and  algorithms  ap- 
ply for  developing  recognizers  in  a large  variety  of  languages. 

Depending  on  the  level  of  spoken  language  representa- 
tion, a more  or  less  important  language-dependency  is  observed. 
Whereas  the  acoustic  parameter  front-end  can  be  considered  as 
mostly  language-independent,  words  and  their  pronunciations 
are  completely  language-dependent.  Successful  porting  to  a new 
target  language  then  requires  appropriate  language-specific  re- 
sources, among  the  most  important  are  text  material  and  pro- 
nunciation lexica.  The  availability  and  size  of  these  resources  is 
significantly  linked  to  the  final  recognizer’s  performance.  De- 
veloping multilingual  resources  is  expensive,  even  if  dedicated 
tools  exist  and  speed  up  the  transcription  and  annotation  pro- 
cess. Porting  an  ASR  system  to  a new  target  language  requires 
as  minimum  resource  text  material  for  language  modeling  and 
pronunciations  for  the  vocabulary.  Baseline  performance  can 
then  be  improved  either  by  increasing  the  volume  of  training 
material  and/or  by  adding  language-specific  knowledge  in  the 
various  components  [52].  Cross-domain  research  remains  an 
important  area,  to  ensure  reusability  of  these  resources  when 
moving  to  new  application  domains  and  to  increase  ASR  inter- 
operability. To  overcome  the  problem  of  insufficient  or  missing 
data  researchers  are  developing  interpolation  methods  to  com- 
bine corpora.  Language  specificities,  when  accounted  for  prop- 
erly, will  contribute  to  optimize  the  recognizer’s  performance 
for  the  new  language. 

Other  research  directions  concern  more  language- 
independent  approaches  for  speech  recognition,  and  more 
specifically  for  acoustic  modeling.  The  IPA  phone  sym- 
bol set  can  theoretically  be  used  to  train  a collection  of 
language-independent  acoustic  phone  models  covering  all 
possible  sounds.  Language-independent  approaches  are  being 
investigated  [47,  32,  31],  and  have  shown  a certain  success 
in  porting  systems  to  new  languages.  Language-independent 
models  have  proven  useful  in  bootstrapping  recognizers  for  a 
new  language.  Comparative  studies  show  that  a small  corpus  of 
language-specific  acoustic  data  (1  hour)  then  rapidly  allows  to 
train  or  adapt  better  acoustic  models  [3 1 ]. 

Lexical  modeling  comprising  the  definition  of  the  recog- 
nizer’s vocabulary  (word  list)  with  corresponding  pronuncia- 
tions rely  on  completely  language-dependent  resources.  Vocab- 
ularies are  often  chosen  as  frequent  words  occurring  in  training 
text  corpora  which  also  ensure  a good  coverage  of  the  appli- 
cation. To  overcome  a lack  of  target  text  corpora  for  vocab- 
ulary definition,  bilingual  (multilingual)  dictionaries  can  con- 
tribute to  port  vocabularies  from  source  to  target  languages. 
But  language-dependent  resources  are  necessary  for  word  level 
modeling  (target  language  text  corpora  or  multilingual  dictio- 
naries, letter  to  sound  rules  . . .).  Statistical  language  modeling 
for  a new  target  language  generally  requires  huge  amounts  of 
text  corpora.  New  challenging  research  directions  joining  the 
domains  of  machine  translation  and  cross-language  information 
retrieval  may  contribute  in  increasing  multilingual  interoperabil- 
ity in  the  future. 

Multilingual  interoperability  in  automatic  speech  recogni- 
tion can  be  seen  as  a goal,  as  a guiding  principle  to  orient 
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research  away  from  purely  language-dependent  towards  more 
language-independent  questions.  This  is  an  important  goal  to 
strive  for.  As  the  number  of  written  languages  remains  rela- 
tively low,  we  can  imagine  having  baseline  resources  available 
for  a large  proportion  of  written  languages  in  a near  future. 
An  important  research  issue  then  consists  in  defining  and  de- 
veloping these  resources  and  generic  corpora,  which  allow  for 
easy  adaptation  across  domains  and  languages.  The  availability 
of  these  resources  for  a large  proportion  of  the  spoken/written 
languages  will  allow  to  judge  the  multilingual  capabilities  of 
present  speech  recognition  technology.  As  underlined  by  V.  Zue 
in  his  keynote  paper  of  Eurospeech ’97  [57],  real  deployment 
of  spoken  language  technology  cannot  take  place  without  ade- 
quately addressing  this  problem  of  portability. 
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